20 research outputs found
CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model
Supervised crowd counting relies heavily on costly manual labeling, which is
difficult and expensive, especially in dense scenes. To alleviate the problem,
we propose a novel unsupervised framework for crowd counting, named CrowdCLIP.
The core idea is built on two observations: 1) the recent contrastive
pre-trained vision-language model (CLIP) has presented impressive performance
on various downstream tasks; 2) there is a natural mapping between crowd
patches and count text. To the best of our knowledge, CrowdCLIP is the first to
investigate the vision language knowledge to solve the counting problem.
Specifically, in the training stage, we exploit the multi-modal ranking loss by
constructing ranking text prompts to match the size-sorted crowd patches to
guide the image encoder learning. In the testing stage, to deal with the
diversity of image patches, we propose a simple yet effective progressive
filtering strategy to first select the highly potential crowd patches and then
map them into the language space with various counting intervals. Extensive
experiments on five challenging datasets demonstrate that the proposed
CrowdCLIP achieves superior performance compared to previous unsupervised
state-of-the-art counting methods. Notably, CrowdCLIP even surpasses some
popular fully-supervised methods under the cross-dataset setting. The source
code will be available at https://github.com/dk-liang/CrowdCLIP.Comment: Accepted by CVPR 202
SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model
With the development of large language models, many remarkable linguistic
systems like ChatGPT have thrived and achieved astonishing success on many
tasks, showing the incredible power of foundation models. In the spirit of
unleashing the capability of foundation models on vision tasks, the Segment
Anything Model (SAM), a vision foundation model for image segmentation, has
been proposed recently and presents strong zero-shot ability on many downstream
2D tasks. However, whether SAM can be adapted to 3D vision tasks has yet to be
explored, especially 3D object detection. With this inspiration, we explore
adapting the zero-shot ability of SAM to 3D object detection in this paper. We
propose a SAM-powered BEV processing pipeline to detect objects and get
promising results on the large-scale Waymo open dataset. As an early attempt,
our method takes a step toward 3D object detection with vision foundation
models and presents the opportunity to unleash their power on 3D vision tasks.
The code is released at https://github.com/DYZhang09/SAM3D.Comment: Technical Report. The code is released at
https://github.com/DYZhang09/SAM3
Paint and Distill: Boosting 3D Object Detection with Semantic Passing Network
3D object detection task from lidar or camera sensors is essential for
autonomous driving. Pioneer attempts at multi-modality fusion complement the
sparse lidar point clouds with rich semantic texture information from images at
the cost of extra network designs and overhead. In this work, we propose a
novel semantic passing framework, named SPNet, to boost the performance of
existing lidar-based 3D detection models with the guidance of rich context
painting, with no extra computation cost during inference. Our key design is to
first exploit the potential instructive semantic knowledge within the
ground-truth labels by training a semantic-painted teacher model and then guide
the pure-lidar network to learn the semantic-painted representation via
knowledge passing modules at different granularities: class-wise passing,
pixel-wise passing and instance-wise passing. Experimental results show that
the proposed SPNet can seamlessly cooperate with most existing 3D detection
frameworks with 1~5% AP gain and even achieve new state-of-the-art 3D detection
performance on the KITTI test benchmark. Code is available at:
https://github.com/jb892/SPNet.Comment: Accepted by ACMMM202
SOOD: Towards Semi-Supervised Oriented Object Detection
Semi-Supervised Object Detection (SSOD), aiming to explore unlabeled data for
boosting object detectors, has become an active task in recent years. However,
existing SSOD approaches mainly focus on horizontal objects, leaving
multi-oriented objects that are common in aerial images unexplored. This paper
proposes a novel Semi-supervised Oriented Object Detection model, termed SOOD,
built upon the mainstream pseudo-labeling framework. Towards oriented objects
in aerial scenes, we design two loss functions to provide better supervision.
Focusing on the orientations of objects, the first loss regularizes the
consistency between each pseudo-label-prediction pair (includes a prediction
and its corresponding pseudo label) with adaptive weights based on their
orientation gap. Focusing on the layout of an image, the second loss
regularizes the similarity and explicitly builds the many-to-many relation
between the sets of pseudo-labels and predictions. Such a global consistency
constraint can further boost semi-supervised learning. Our experiments show
that when trained with the two proposed losses, SOOD surpasses the
state-of-the-art SSOD methods under various settings on the DOTA-v1.5
benchmark. The code will be available at https://github.com/HamPerdredes/SOOD.Comment: Accepted to CVPR 2023. Code will be available at
https://github.com/HamPerdredes/SOO
SGM3D: Stereo Guided Monocular 3D Object Detection
Monocular 3D object detection aims to predict the object location, dimension
and orientation in 3D space alongside the object category given only a
monocular image. It poses a great challenge due to its ill-posed property which
is critically lack of depth information in the 2D image plane. While there
exist approaches leveraging off-the-shelve depth estimation or relying on LiDAR
sensors to mitigate this problem, the dependence on the additional depth model
or expensive equipment severely limits their scalability to generic 3D
perception. In this paper, we propose a stereo-guided monocular 3D object
detection framework, dubbed SGM3D, adapting the robust 3D features learned from
stereo inputs to enhance the feature for monocular detection. We innovatively
present a multi-granularity domain adaptation (MG-DA) mechanism to exploit the
network's ability to generate stereo-mimicking features given only on monocular
cues. Coarse BEV feature-level, as well as the fine anchor-level domain
adaptation, are both leveraged for guidance in the monocular domain.In
addition, we introduce an IoU matching-based alignment (IoU-MA) method for
object-level domain adaptation between the stereo and monocular predictions to
alleviate the mismatches while adopting the MG-DA. Extensive experiments
demonstrate state-of-the-art results on KITTI and Lyft datasets.Comment: 8 pages, 5 figure
Diffusion-based 3D Object Detection with Random Boxes
3D object detection is an essential task for achieving autonomous driving.
Existing anchor-based detection methods rely on empirical heuristics setting of
anchors, which makes the algorithms lack elegance. In recent years, we have
witnessed the rise of several generative models, among which diffusion models
show great potential for learning the transformation of two distributions. Our
proposed Diff3Det migrates the diffusion model to proposal generation for 3D
object detection by considering the detection boxes as generative targets.
During training, the object boxes diffuse from the ground truth boxes to the
Gaussian distribution, and the decoder learns to reverse this noise process. In
the inference stage, the model progressively refines a set of random boxes to
the prediction results. We provide detailed experiments on the KITTI benchmark
and achieve promising performance compared to classical anchor-based 3D
detection methods.Comment: Accepted by PRCV 202
Microstructure and mechanical properties of refill friction stir spot welded joints: Effects of tool size and welding parameters
A novel refill friction stir spot welding (RFSSW) technique employing large-sized tools is proposed. The microstructure and mechanical properties of joints produced with a large-sized welding tool and a conventional tool are compared. The results show that the exit line resulting from the sleeve becomes longer with increasing the plunge depth, and the diameter of nugget increases with higher rotational speed for the joints produced by both conventional and novel tools. The plunge depth increases from 2.0 mm to 2.2 mm, then to 2.4 mm, which affects the hook defect to bend upwards, almost parallel to the lap interface, to bend downwards, respectively. The joints produced with the novel tool have a flat hook compared to the conventional tool. The microstructure evolution of the conventional and novel joints is similar. The tensile-shear and tearing forces measured of the novel joints are higher than those of the conventional joints for the same welding parameters. For conventional joints, the maximum tensile-shear and tearing forces are 8.6 ± 0.1 kN and 4.4 ± 0.2 kN, respectively. The maximum tensile-shear and tearing forces for novel joints are 10.9 ± 0.1 kN and 5.6 ± 0.1 kN, respectively. After the tensile-shear test, there are three modes of fracture, the upper-mixed, the lower-mixed, and the shear fracture one. The plunge depth has a pronounced effect on the fracture mode of joints